Skip to content

fix: correct OutboxGCWorkflow timeout and tighten schedule config#2947

Open
disintegrator wants to merge 4 commits into
mainfrom
outbox-gc-tuning
Open

fix: correct OutboxGCWorkflow timeout and tighten schedule config#2947
disintegrator wants to merge 4 commits into
mainfrom
outbox-gc-tuning

Conversation

@disintegrator
Copy link
Copy Markdown
Contributor

The workflow was timing out on every run. After GCOutboxProcessedRows returned 0 eligible rows it called workflow.Sleep(1h), but the schedule set WorkflowRunTimeout to 15 minutes — guaranteeing the workflow was always killed before the sleep fired.

Changes:

  • Remove the in-workflow sleep entirely. The workflow now returns nil once a partial batch confirms no further rows remain; the Temporal schedule handles re-triggering.
  • Tighten timeouts to match actual workload: schedule interval 6h→5min, activity StartToCloseTimeout 10min→1min, WorkflowRunTimeout 15min→2min. At current volume (~40K rows steady-state, ~4 rows/min arriving) each run deletes ~20 rows in a single activity call that completes in milliseconds.
  • Make AddOutboxGCSchedule upsert: on ErrScheduleAlreadyRunning it now calls handle.Update to push the new spec and action to the existing schedule, so config changes take effect on deploy without manual intervention via the Temporal UI.

DB impact: the more frequent schedule means terminal rows are cleaned up within ~5 minutes of crossing the 7-day retention threshold instead of up to 6 hours late. Each run issues at most one batched DELETE of ≤100 rows, well within autovacuum's ability to reclaim dead tuples at this volume.

The workflow was timing out on every run. After GCOutboxProcessedRows
returned 0 eligible rows it called workflow.Sleep(1h), but the schedule
set WorkflowRunTimeout to 15 minutes — guaranteeing the workflow was
always killed before the sleep fired.

Changes:
- Remove the in-workflow sleep entirely. The workflow now returns nil
  once a partial batch confirms no further rows remain; the Temporal
  schedule handles re-triggering.
- Tighten timeouts to match actual workload: schedule interval 6h→5min,
  activity StartToCloseTimeout 10min→1min, WorkflowRunTimeout 15min→2min.
  At current volume (~40K rows steady-state, ~4 rows/min arriving) each
  run deletes ~20 rows in a single activity call that completes in
  milliseconds.
- Make AddOutboxGCSchedule upsert: on ErrScheduleAlreadyRunning it now
  calls handle.Update to push the new spec and action to the existing
  schedule, so config changes take effect on deploy without manual
  intervention via the Temporal UI.

DB impact: the more frequent schedule means terminal rows are cleaned up
within ~5 minutes of crossing the 7-day retention threshold instead of
up to 6 hours late. Each run issues at most one batched DELETE of ≤100
rows, well within autovacuum's ability to reclaim dead tuples at this
volume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@disintegrator disintegrator requested a review from a team as a code owner May 20, 2026 12:55
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gram-docs-redirect Ready Ready Preview, Comment May 20, 2026 2:32pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 20, 2026

⚠️ No Changeset found

Latest commit: b6a4854

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Member

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, looks good to me 🚀

Comment thread server/internal/background/outbox_gc.go
Comment thread server/internal/background/outbox_gc.go
@blacksmith-sh

This comment has been minimized.

@github-actions github-actions Bot added the preview Spawn a preview environment label May 20, 2026
@speakeasybot
Copy link
Copy Markdown
Collaborator

speakeasybot commented May 20, 2026

🚀 Preview Environment (PR #2947)

Preview URL: https://pr-2947.dev.getgram.ai

Component Status Details Updated (UTC)
✅ Database Ready Existing database reused 2026-05-21 08:09:21.
✅ Images Available Container images ready 2026-05-21 08:07:56.

Gram Preview Bot

@disintegrator disintegrator enabled auto-merge May 20, 2026 14:34
@disintegrator disintegrator added this pull request to the merge queue May 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 20, 2026
@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented May 20, 2026

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
github.com/speakeasy-api/gram/server/internal/tools/TestMain View Logs

Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Spawn a preview environment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants